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Abstract 

Given  a  large  un-transcribed  corpus  of  speech  utterances,  we 
address  the  problem  of  how  to  select  a  good  subset  for  word- 
level  transcription  under  a  given  fixed  transcription  budget.  We 
employ  submodular  active  selection  on  a  Fisher-kernel  based 
graph  over  un-transcribed  utterances.  The  selection  is  theoreti¬ 
cally  guaranteed  to  be  near-optimal.  Moreover,  our  approach  is 
able  to  bootstrap  without  requiring  any  initial  transcribed  data, 
whereas  traditional  approaches  rely  heavily  on  the  quality  of  an 
initial  model  trained  on  some  labeled  data.  Our  experiments 
on  phone  recognition  show  that  our  approach  outperforms  both 
average-case  random  selection  and  uncertainty  sampling  signif¬ 
icantly. 

Index  Terms:  Transcription,  labeling,  submodularity,  submod¬ 
ular  selection,  active  learning,  sequence  labeling,  phone  recog¬ 
nition,  speech  recognition 

1.  Introduction 

In  automatic  speech  recognition  and  many  other  language  ap¬ 
plications,  unlabeled  data  are  abundant  but  labels  (e.g.,  tran¬ 
scriptions)  are  expensive  and  time-consuming  to  acquire.  For 
example,  large  amounts  of  speech  data  can  easily  be  obtained 
via  telephone  calls,  and  via  modern  voice-based  applications 
such  as  Microsoft’s  Tellme  and  Google’s  voice  search.  Ideally, 
it  would  be  possible  to  label  all  of  this  data  for  use  as  a  train¬ 
ing  set  in  a  speech  recognition  system,  as  aptly  conveyed  by 
the  well  known  phrase  “there  is  no  data  like  more  data.”  Un¬ 
fortunately,  this  would  be  impractical  given  the  ever  increasing 
amount  of  available  unlabeled  data.  Accurate  phonetic  tran¬ 
scription  of  speech  utterances  requires  phonetic  training  and 
even  then  it  may  take  a  month  to  annotate  1  hour  of  speech 
(D.  not  to  mention  the  difficulty  of  transcribing  at  the  articula¬ 
tory  level.  Partly  due  to  this,  such  low-level  transcription  efforts 
have  been  sidelined  by  the  community  in  favor  of  word-level 
transcriptions.  But  even  word  level  transcriptions  are  time  con¬ 
suming  (about  10  times  real  time),  especially  for  conversational 
spontaneous  speech.  This  problem  is  particularly  acute  for  un¬ 
derrepresented  languages  or  dialects  with  few  speakers,  where 
linguistic  experts  are  even  harder  to  find. 

In  this  paper,  we  address  the  following  question:  given  lim¬ 
ited  resources  (time  and/or  budget),  how  can  we  optimally  se¬ 
lect  a  training  data  subset  for  transcription  such  that  the  result¬ 
ing  system  has  optimal  performance.  In  fact,  this  is  a  well- 
known  problem  and  goes  by  the  name  of  batch  active  learning, 
where  a  subset  of  data  that  is  most  informative  and  represen¬ 
tative  of  the  whole  is  selected  for  labeling.  Often,  examples 
are  queried  in  a  greedy  fashion  according  to  an  informativeness 
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measure  used  to  evaluate  all  examples  in  the  pool.  Two  popu¬ 
lar  strategies  for  measuring  informativeness  include  uncertainty 
sampling  and  the  query-by-committee  approach.  Uncertainty 
sampling  |2)  is  the  simplest  and  most  commonly  used  strategy. 
In  this  framework,  an  initial  system  is  trained  typically  using 
a  small  set  of  labeled  examples.  Then,  the  system  examines 
the  rest  of  the  unlabeled  examples,  and  queries  examples  that  it 
is  most  uncertain  about.  The  measurement  of  uncertainty  can 
either  be  entropy  BIS  Si  or  a  confidence  score  BUI  HE). 
Query-by-committee  ilED  also  starts  with  labeled  data.  A 
set  of  distinct  models  are  trained  as  committee  members.  Each 
committee  member  is  then  allowed  to  vote  on  the  labellings  of 
the  unlabeled  examples.  The  most  informative  example  is  taken 
as  the  one  the  committee  most  disagrees  about. 

It  has  been  shown  that  both  uncertainty  sampling  and 
query-by-committee  may  fail  when  they  tend  to  query  outliers, 
which  is  the  main  motivating  factor  for  other  strategies  like  es¬ 
timated  error  reduction  CU.  The  problem  is  that  outliers  might 
have  high  uncertainty  (or  a  committee  might  find  them  contro¬ 
versial)  but  they  are  not  good  surrogates  for  “typical”  samples. 
Indeed,  an  ideal  selection  strategy  should  choose  a  subset  of 
samples  that,  when  considered  together,  constitute  in  some  form 
a  good  representation  of  the  entire  training  data  set.  Methods 
such  as  E]  [MOT  address  this  problem,  all  of  which  have 
been  shown  to  be  superior  to  methods  that  do  not  consider  rep¬ 
resentativeness  measures.  Our  approach  herein  also  belongs  to 
this  category.  In  particular,  we  use  Fisher  kernel  (Section  |4j 
to  build  a  graph  over  the  unlabeled  sample  sequences,  and  op¬ 
timize  submodular  functions  (to  be  defined)  over  the  graph  to 
find  the  most  representative  subset.  Note  that  our  Fisher  ker¬ 
nel  is  over  an  unsupervised  generative  model,  which  enables  us 
to  bootstrap  our  active  learning  approach  without  needing  any 
initial  labeled  data,  yet  we  achieve  good  performance  (see  Sec¬ 
tion  |5j  perhaps  because  of  the  approximate  optimality  of  our 
submodular  procedures.  This  approach  portends  well  to  under¬ 
represented  languages  for  which  an  initial  labeled  set  might  be 
unavailable. 

Despite  pre-existing  extensive  studies  of  active  learning, 
there  is  relatively  little  work  on  active  learning  for  sequence  la¬ 
beling.  Several  methods  have  been  proposed,  most  of  which  are 
based  either  on  uncertainty  sampling  or  query-by-committee. 
In  EHSIISI,  confidence  scores  from  a  speech  recognizer  are 
used  to  indicate  the  informativeness  of  speech  utterances.  Ac¬ 
tive  learning  methods  in  DU  select  the  most  uncertain  exam¬ 
ples  based  on  an  EM-style  algorithm  for  learning  HMMs  from 
partially  labeled  data.  In  [18],  several  objective  functions  and 
algorithms  are  introduced  for  active  learning  in  HMMs.  Sev¬ 
eral  new  query  strategies  for  probabilistic  sequence  models  are 
introduced  in  (3|  and  an  empirical  analysis  is  conducted  on  a  va¬ 
riety  of  benchmark  datasets.  Our  approach  can  be  distinguished 
from  these  methods  in  that  we  select  the  most  representative 
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subset  in  a  submodular  framework,  where  submodularity  the¬ 
oretically  guarantees  that  the  selection  problem  can  be  solved 
efficiently  and  near-optimally  (see  Section  [2]  Theorem  E  and 
TheoremE.  Submodularity  has  already  been  successfully  used 
in  active  learning  tasks.  Robust  submodular  observation  selec¬ 
tion  is  explored  in  [ED.  in  ED,  the  authors  relate  Fisher  infor¬ 
mation  matrices  to  submodular  functions  so  that  the  optimiza¬ 
tion  can  be  done  efficiently  and  effectively.  To  the  best  of  our 
knowledge,  our  approach  is  the  first  work  that  incorporates  sub¬ 
modularity  for  active  learning  in  sequence  labeling  tasks  such 
as  speech  recognition. 


The  greedy  algorithm,  moreover,  is  likely  to  be  the  best  we 
can  do  in  polynomial  time,  unless  P  =  NP. 

Theorem  2.  Feige  1998  |22j  Unless  P=NP.  there  is  no 
polynomial-time  algorithm  that  guarantees  a  solution  S*  with 

z(S*)  >  (1  —  1/e  +  e)  max  z(S),  e  >  0  (5) 

|S|<JC 


3.  Submodular  Selection 


2.  Background 

2.1.  Submodularity 

Consider  a  set  function  z  :  2l  —>  M,  which  maps  subsets  S  C 
V  of  a  finite  set  V  to  real  numbers.  Intuitively,  V  is  the  set  of 
all  unlabeled  utterances,  and  the  function  z(-)  scores  the  quality 
of  any  chosen  subset,  z(-)  is  called  submodular\2Q\  if  for  any 
S,TCV, 


z(SuT)  +  z(SCT)  <  z(S)  +z(T)  (1) 

An  equivalent  condition  for  submodularity  is  the  property  of 
diminishing  returns.  That  is  for  any  R  C  S  C  V  and  s  £  V, 

z(S\J{s})-z{S)<z(RU{s})-z(R)  (2) 

Intuitively,  this  means  that  adding  an  element  s  helps  at  least  as 
much  as  if  we  add  it  to  a  smaller  set  R.  than  if  we  add  it  to  the 
superset  S.  Submodularity  is  the  discrete  analog  of  convexity 
(201.  As  convexity  makes  continuous  functions  more  amenable 
to  optimization,  submodularity  plays  an  essential  role  in  combi¬ 
natorial  optimization.  Common  submodular  functions  appear  in 
many  important  settings  including  graph-cut  ED,  set  covering 
ED,  and  facility  location  problems  ( 23.1 . 


Batch  active  learning  problems  are  often  cast  as  a  data  subset 
selection,  where  the  active  learner  can  ask  for  the  labels  of  the 
subset  of  data  of  size  within  budget,  and  that  is  most  likely 
to  yield  the  most  accurate  classifier.  Problem  0  can  also  be 
viewed  as  a  data  selection  problem.  Suppose  we  have  a  set  of 
unlabeled  training  examples  V  =  {1,2,  ...,N},  where  certain 
pairs  (i,j)  are  similar  and  the  similarity  of  i  and  j  is  measured 
by  a  nonnegative  value  wt.j.  We  can  represent  the  unlabeled 
data  using  a  graph  G  =  (V,  E),  with  nonnegative  weights  uaj 
associated  with  each  edge  (i,  j).  The  data  selection  problem  is 
to  find  a  subset  S  that  is  most  representative  of  the  whole  set 
V,  given  the  constraint  |  S'!  <  K.  To  measure  how  “representa¬ 
tive”  S  is  of  the  whole  set  V,  we  introduce  several  submodular 
set  functions. 


3.1.  Submodular  Set  Functions 


Our  first  objective  is  the  uncapacitated  facility  location  function 

ED: 


Facility  location:  z\  ( S )  = 


E 


maxuii  , 
j'es 


(6) 


It  measures  the  similarity  of  S  to  the  whole  set  V.  We  can  also 
measure  the  similarity  of  S  to  the  remainder,  i.e.,  the  graph  cut 
function: 


2.2.  Submodular  Selection 

We  want  to  select  a  good  subset  S  of  training  data  V  that  max¬ 
imizes  some  objective  function,  such  that  the  size  of  S  is  no 
larger  than  K  (our  budget).  That  is,  we  wish  to  compute: 

maxes')  :  [SI  <  A'}  (3) 

SCV  — 

While  NP  hard,  this  problem  can  be  approximately  solved  us¬ 
ing  a  simple  greedy  forward-selection  algorithm.  The  algorithm 
starts  with  S  =  0,  and  iteratively  adds  the  element  s  £  V  \  S 
that  maximally  increases  the  objective  function  value,  i.e., 

s  =  argmaxsens  z{S  U  {s})  (4) 


Graph  cut:  «2  (S)  =  E  (7) 

iev\sjes 

Both  of  these  functions  are  submodular  as  seen  by  verifying 
inequality[2](proof  omitted  due  to  space  limitations). 

In  order  to  apply  TheoremE  the  objective  function  should 
also  satisfy  the  nondecreasing  property.  Obviously,  the  facility 
location  objective  function  is  nondecreasing.  For  the  graph  cut 
objective,  the  increment  of  adding  k  into  S  is 

z2(S  U  {k})  -  Z2(S)  =  ^2  wi,k-  ^2  Wk'i 
iev\s  jesu{k} 


until  |Sj  =  K.  Actually,  when  «(•)  is  a  nondecreasing  and 
normalized  submodular  set  function,  this  simple  greedy  algo¬ 
rithm  performs  near-optimally  as  guaranteed  by  the  following 
theorems. 

Theorem  1.  Nemhauser  et  al.  1978  l24l.  If  submodular  func¬ 
tion  z(-)  satisfies:  i)  nondecreasing:  for  all  Si  C  S2  C  V, 
z('Si)  <  z(S2):  H)  normalized:  z(fb)  =  0,  then  the  set  Sq 
obtained  by  the  greedy  algorithm  is  no  worse  than  a  constant 
fraction  (1  —  1/e)  away  from  the  optimal  value,  i.e., 

z(Sg)  >  l  1  —  —  )  max  z(S) 

\  e  J  SCV:|S|<K 


which  is  not  always  nonnegative.  Fortunately,  the  proof  of  The- 
orem[T|does  not  use  the  monotone  property  for  all  possible  sets 
12411 1 9l  page  58].  The  graph  cut  can  also  meet  the  conditions 
for  TheoremE  if  I 'S' I  |  Vj,  which  is  usually  the  case  in  appli¬ 
cations  where  we  have  a  large  amount  of  data  but  only  limited 
resources  for  labeling. 

With  the  above  objectives,  we  can  use  the  greedy  algo¬ 
rithm  to  solve  the  data  selection  problem  efficiently  and  near- 
optimally.  The  greedy  algorithm  for  submodular  data  selection 
with  the  facility  location  objective  is  described  in  AlgorithmE 
where  pi  =  maxjgs  Wij  is  updated  to  optimize  the  running  of 
the  algorithm.  The  graph-cut  objective  algorithm  is  similar  and 
is  omitted  to  conserve  space. 


Algorithm  1  Greedy  algorithm  for  facility  location  objective 
1:  Input:  G  =  (V,E)  with  weights  Wij  on  edge  (*,y);  K: 

the  number  of  examples  to  be  selected 
2:  Initialization:  S  =  0,  pi  =  0,i  =  1  where  N  = 

\V\ 

3:  while  |Sj  <  K  do 

4:  k ’  =  argmaxfeens;>U6V(ifc,6B  (max 

5:  S=SU{k*} 

6:  for  all  i  e  V  do 

7:  p;  =  max{pi, 

8:  end  for 

9:  end  while 


4.  Fisher  Kernel 

We  express  the  pairwise  “similarity”  between  the  utterances 
i  and  j  in  terms  of  kernel  function  n{i,j)  so  that  Wij  = 
Since  the  examples  are  sequences  with  possibly  differ¬ 
ent  lengths,  we  use  the  Fisher  kernel  (231,  which  is  applicable  to 
variable  length  sequences.  Consider  a  generative  model  (e.g.,  a 
hidden  Markov  models,  or  more  generally,  a  dynamic  Bayesian 
network  (DBN))  with  parameters  9  that  models  the  generation 
process  of  the  sequence.  Denote  X,L  =  (xi:i, . . . ,  Xi.T;)  as  the 
i111  feature  sequence  with  length  71,,  Then  a  fixed  length  vector, 
known  as  the  Fisher  score,  can  be  extracted  as: 

Ui  =  ^i0gp{Xi\e)  (8) 

Each  component  of  Ui  is  a  derivative  of  the  log-likelihood  score 
for  the  sequence  Xi  with  respect  to  a  particular  parameter  — 
the  Fisher  score  is  thus  a  vector  having  the  same  length  as  the 
number  of  parameters  9.  The  computation  of  gradients  in  Eq.[8] 
in  the  context  of  DBNs  is  described  in  detail  in  |:26|. 

Given  Fisher  scores,  different  sequences  with  different 
lengths  may  be  represented  by  fixed-length  vectors,  so  we  can 
easily  define  several  Fisher  kernel  functions  to  measure  pair¬ 
wise  similarity,  e.g.,  cosine  similarity,  radial-basis  function 
(RBF)  kernel  similarity,  or  as  shown  below,  the  negative  i\  sim¬ 
ilarity: 

Negative  i\  norm:  n(i,j)  =  — 1|  Ui  —  Uj  ||i  (9) 

The  generative  model  that  is  used  to  generate  the  Fisher  score 
may  contain  several  types  of  parameters  (i.e.,  discrete  condi¬ 
tional  probability  tables  and  continuous  Gaussian  parameters), 
and  the  values  associated  with  different  types  of  parameters  may 
have  quite  different  numeric  dynamic  ranges.  In  order  to  re¬ 
duce  the  heterogeneity  within  the  Fisher  score  vector,  all  our 
experiments  apply  the  following  global  variance  normalization 
to  produce  the  final  Fisher  score  vectors  U[ : 

U'i  =  (diag(E))-3  •  (Ui  —  U)  (10) 

where  (7=  ^  £^i  K  and  E  =  j.  -U)T (Ui-U) 

5.  Experiments 

We  evaluated  our  methods  on  a  phone  recognition  task  using 
the  TIMIT  corpus.  Random  selection  was  used  as  a  base¬ 
line.  Specifically,  we  randomly  take  p%  of  the  TIMIT  train¬ 
ing  set,  where  p  =  2.5,  5, 10,  20,  30, 40,  50, 60,  70,  80,  90.  For 
each  subset,  a  3-state  context-independent  (Cl)  hidden  Markov 
model  (HMM)  (implemented  as  a  DBN)  was  trained  for  each 
of  the  48  phones.  The  number  of  Gaussian  components  in  the 
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Figure  1 :  Relative  improvements  over  the  average  phone  error 
rate  of  random  selection.  No  initial  model  scenario. 

Gaussian  mixture  model  (GMM)  was  optimized  according  to 
the  amount  of  training  data  available.  The  48  phones  were  then 
mapped  down  to  39  phones  for  scoring  purposes  following  stan¬ 
dard  practice  (22)  ■  Recognition  was  performed  using  standard 
Viterbi  search  without  a  phonetic  language  model  (a  language 
model  was  not  used  here  to  emphasize  the  acoustic  modeling 
performance,  and  since  this  speeds  up  experimental  turnaround 
time  by  avoiding  tedious  language  model  scaling  and  penalty 
parameter  tuning  when  large  random  selection  experiments  are 
performed).  100  trials  of  random  selection  experiments  were 
performed  for  each  of  the  percentage  numbers  above.  The  aver¬ 
age  phone  error  rates  (PER)  were  calculated  and  used  as  base¬ 
line.  The  standard  deviation  was  around  0.01  for  small  p  and 
about  0.005  for  larger  p.  Apart  from  the  data  selection  strategy, 
experiments  on  uncertainty  sampling  and  submodular  selection 
followed  exactly  the  same  setups  as  random  selection. 

Uncertainty  sampling  and  submodular  selection  were  eval¬ 
uated  under  two  scenarios.  The  first  scenario  we  considered  is 
when  there  is  no  initial  model  available.  In  this  scenario,  uncer¬ 
tainty  sampling  would  typically  randomly  select  a  small  portion 
of  the  unlabeled  data  to  label,  and  then  train  an  initial  model 
using  these  randomly  selected  data.  We  did  the  following:  a) 
randomly  select  a%  of  the  training  data,  acquire  the  labels  and 
train  an  initial  model;  b)  use  the  learned  model  to  predict  the  un¬ 
labeled  data,  select  the  M  most  uncertain  samples  for  labelling; 
c)  retrain  the  model  using  all  labeled  data.  If  the  number  of  la¬ 
beled  samples  reaches  the  target  amount,  stop,  else  go  to  step 
b).  We  used  a  =  1  and  M  =  100  in  the  experiments,  and  the 
average  per-frame  log-likelihood  was  used  as  the  uncertainty 
measurement. 

For  our  submodular  selection  method,  HMMs  with  16- 
component  GMMs  were  obtained  by  unsupervised  training  us¬ 
ing  all  the  unlabeled  data.  This  model  was  used  as  the  genera¬ 
tive  model  for  the  Fisher  score  using  gmtkKernel,  a  GMTK 
(28l  DBN  implementation  of  Fisher  kernels.  The  negative 
norm  was  used  to  construct  the  graph  (we  also  tested  other 
measures  which  had  similar  results).  The  relative  PER  im¬ 
provements  over  the  average  of  the  100  random  experiments 
are  shown  in  Figure  [T|  As  we  can  see,  uncertainty  sampling 
achieves  improvements  over  random  sampling  in  general,  but 
when  the  target  percentage  number  is  small  (i.e.,  2.5%  and  5%), 
which  is  usually  the  case  in  real-world  applications,  it  performs 
similarly  to  random  selection  since  the  model  used  for  the  un¬ 
certainty  measurement  is  of  low  quality.  On  the  other  hand,  sub¬ 
modular  data  selection  outperforms  both  random  selection  and 


2.5  5  10  20  30  40  50  60  70  80  90 

Percentage(%)  of  the  training  data 

Figure  2:  Relative  improvements  over  the  average  phone  error 
rate  of  random  selection.  With  initial  model  scenario. 

uncertainty  sampling,  especially  when  the  percentage  is  small. 
This  implies  that  even  a  model  trained  without  any  labeling  in¬ 
formation  works  quite  well  for  our  approach.  In  other  words, 
the  submodular  data  selection  approach  proposed  here  is  quite 
robust  to  the  scenario  where  no  initial  “boot”  model  is  available. 

Our  second  scenario  is  when  an  initial  model  is  available 
to  help  the  data  selection.  Such  a  model  should  have  reason¬ 
able  quality.  In  our  experiments,  we  assume  a  very  high  quality 
initial  model  to  strongly  contrast  with  our  first  scenario  -  an 
initial  model  with  16-component  GMM-HMMs  was  trained  on 
all  the  labeled  TIMIT  data,  which  was  then  used  in  the  uncer¬ 
tainty  sampling  approach,  and  also  in  the  submodular  selection 
method  as  the  generative  model.  The  results  are  shown  in  Fig¬ 
ure^ —  with  a  better  quality  initial  model,  uncertainty  sampling 
performs  better  when  selecting  small  percentages  of  the  data  but 
not  necessarily  with  more  data  (presumably  due  to  its  selection 
of  unrepresentative  outliers).  Submodular  data  selection  also 
performs  better  in  general  with  a  better  quality  initial  model. 
In  particular,  more  than  12%  relative  improvement  over  ran¬ 
dom  selection  is  achieved  when  selecting  2.5%  of  the  data.  And 
again,  submodular  selection  outperforms  both  random  sampling 
and  uncertainty  sampling.  Also,  notice  that  there  are  only  rel¬ 
atively  minor  performance  drops  in  our  approach  when  shift¬ 
ing  from  a  supervised  trained  initial  model  to  an  unsupervised 
trained  initial  model,  illustrating  yet  again  that  submodular  se¬ 
lection  seems  robust  to  the  quality  of  the  initial  model. 
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